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Abstract 

Intuitively, the appearance of true object boundaries 
varies from image to image. Hence the usual monolithic 
approach of training a single boundary predictor and ap¬ 
plying it to all images regardless of their content is bound 
to be suboptimal. In this paper we therefore propose situ¬ 
ational object boundary detection; We first define a variety 
of situations and train a specialized object boundary detec¬ 
tor for each of them using [10]. Then given a test image, 
we classify it into these situations using its context, which 
we model by global image appearance. We apply the cor¬ 
responding situational object boundary detectors, and fuse 
them based on the classification probabilities. In experi¬ 
ments on ImageNet [ 35 ], Microsoft COCO [ 24 ], and Pascal 
VOC 2012 segmentation [13] we show that our situational 
object boundary detection gives significant improvements 
over a monolithic approach. Additionally, our method sub¬ 
stantially outperforms [17] on semantic contour detection 
on their SBD dataset. 

1. Introduction 

Most methods for object boundary detection are mono¬ 
lithic and use a single predictor to predict all object bound¬ 
aries in an image [2, 10, 23] regardless of the image con¬ 
tent. But intuitively, the appearance of object boundaries 
is dependent on what is depicted in the image. For exam¬ 
ple, black-white transitions are often good indicators of ob¬ 
ject boundaries, unless the image depicts a zebra as in Fig¬ 
ure 1. Outdoors, the sun may cast shadows which create 
strong contrasts that are not object boundaries, while similar 
colour contrasts in an indoor environment with diffuse light¬ 
ing may be caused by object boundaries. Furthermore, not 
all objects are equally important in all circumstances: one 
may want to detect the boundary between a snowy moun¬ 
tain and the sky in images of winter holidays, while ignoring 
sky-cloud transitions in images depicting air balloons, even 
though such boundaries may be visually very similar. These 
examples show that one cannot expect a monolithic predic¬ 
tor to accurately predict object boundaries in all situations. 

In this work we recognize the need for different object 
boundary detectors in different situations: first we define a 
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Figure 1. Monolithic vs situational object boundary detection. 
Black-white transitions indicate an object boundary for the snow¬ 
board, but are false object boundaries for a zebra. This ambiguity 
cannot be resolved by a monolithic detector. In contrast, by train¬ 
ing class specific object boundary detectors and classifying the 
image as a zebra, we correctly ignore most of the stripes. 

set of situations and pre-train object boundary detectors for 
each of them. For a test image, we classify which situations 
the image depicts based on its context, modelled by global 
image appearance. Then we apply the appropriate set of ob¬ 
ject boundary detectors. Hence conditioned on the situation 
of an image we choose which object boundary detectors to 
run. We call this Situational Object Boundary Detection. 

One important question is how to define such situations. 
Since the appearance of object boundaries are for a large 
part dependent on the object class, one natural choice is to 
use each object class as a single situation. This results in 
class specific object boundary detectors, which can deal for 
example with the zebra in Figure 1 . However, object bound¬ 
aries are also determined by the object pose and the back¬ 
ground or context of the image. Since this can vary within 
a single object class, we propose to cluster images of a sin¬ 
gle class into subclasses based on global image appearance. 
This leads to subclass specific object boundary detectors. 
Finally, one can imagine that the context of the image it¬ 
self determines what kind of object boundaries to expect. 
For example, one can expect cow-grass boundaries in the 
countryside and street-car boundaries in the city. Therefore 
we cluster images based on their global image appearance, 
which results in class agnostic object boundary detectors. 






Hence we experiment with three types of situations: class 
specific, subclass specific, and class agnostic. 

Obviously, situational object boundary detection re¬ 
quires more training data than a monolithic approach. 
Therefore we cannot use the standard BSD500 [2] dataset of 
500 images for our evaluation. Instead, we evaluate on three 
larger datasets: Pascal VOC 2012 segmentation [13], Mi¬ 
crosoft COCO [24], and part of ImageNet [35]. Microsoft 
COCO is two orders of magnitude larger than BSD500. For 
ImageNet we train from segments which are created in a 
semi-supervised fashion by Guillaumin et al. [16]. 

Additionally, our class-specific situational object bound¬ 
ary detectors can also be applied to semantic contour de¬ 
tection, the task of predicting class-specific object bound¬ 
aries [17]. We compare with [17] on their SBD dataset. 

2. Related Work 

Manually defined predictors. Early work on object 
boundary detection aimed to manually define local filters 
to generate edges from an image. In these works, convolu¬ 
tional derivative filters are applied to find local image gra¬ 
dients [12, 32, 34] and their local maximum [6, 28]. 

Trained predictors. But object boundaries arise from a 
complex combination of local cues. Therefore more recent 
techniques resort to machine learning and datasets with an¬ 
notated object boundaries: Martin et al. [29] compute local 
brightness, colour, and texture cues, which they combine 
using a logistic model. Both Mairal et al. [27] and Prasad 
et al. [31] use RGB-features from local patches centred on 
edges found by the canny edge detector [6], which they clas¬ 
sify as true or false positives. Dollar et al. [9] use boosted 
decision trees to predict if the centre label of an image patch 
is an object boundary or not. Lim et al. [23] use Random 
Forests [5] to predict sketch tokens, which are object bound¬ 
ary patches generated by k-means clustering. Dollar and 
Zitnick [10] proposed structured random forests, which use 
object boundary patches as structured output labels inside 
a random forest. Their method is extremely fast and yields 
state-of-the-art results. We build on [10] in our paper. 

Domain specific predictors. Some works that use ma¬ 
chine learning to predict object boundaries observed that 
this enables tuning detectors to specific domains. Dollar et 
al. [9] showed qualitative examples of domain-specific de¬ 
tectors for finding mouse boundaries in a laboratory setting 
and detecting streets in aerial images. Both [27] and [31] 
used class-specific object boundary detectors for boundary- 
based object classification. Whereas in all these cases 
the domain was predefined, in this work we automatically 
choose which object boundary detector to apply at runtime. 

Semantic contour detection. Like [27] and [31], Hariha- 
ran et al. [17] addressed class-specific object boundary de¬ 
tection. They call this ‘semantic contour detection’ and cre¬ 


ate the SBD benchmark to directly evaluate this task. Their 
method combines a monolithic object boundary detector 
(gPb [2]) with object class detectors (Poselets [4]). Since 
the class-specific version of our situational object boundary 
detection can readily be applied to semantic contour detec¬ 
tion, we compare to [17] in Section 4.4. 

Globally constrained predictors. Instead of predicting 
boundaries only at a local level, Arbelaez et al. [2] cast 
the problem into a global optimization framework capturing 
non-local properties in the spirit of Normalized Cuts [36]. 
In this paper we use the global image appearance to deter¬ 
mine the set of local object boundary predictors to use. In 
this sense, the global appearance of the image restricts our 
algorithm to a limited set of expected object boundaries. 

Contextual guidance. Context, as modelled by global 
image appearance, has been successfully used to guide a va¬ 
riety of computer vision tasks. Torralba et al. [38] showed 
that global image features effectively constrain both the ob¬ 
ject class and its location, which is frequently used in object 
localisation (e.g. [13, 14, 18]). Boix et al. [3] do semantic 
segmentation by region prediction, where the global image 
appearance enforces a consistency potential in their hier¬ 
archical CRF. Liu et al. [25] perform semantic segmenta¬ 
tion through label transfer. Given a test image, they retrieve 
nearest neighbours from a pixel-wise annotated dataset us¬ 
ing global image appearance. After region alignment, they 
transfer labels to the test image. In this paper we use con¬ 
text modelled by global image features to select those object 
boundary detectors that correspond to the situation depicted 
in the image. 

3. Method 

3.1. Situational Object Boundary Detection 

Our main idea is visualized in Figure 2. For each specific 
situation, one can train a specialized object boundary detec¬ 
tor. Given a test image, one then only needs to apply those 
boundary detectors which best fit its situation. Intuitively, 
the global image appearance can help distinguish the local 
appearance of true object boundaries from edges caused by 
other phenomena. 

Formally, let D = {Di,..., be a set of /c trained 
object boundary detectors for a corresponding set of k sit¬ 
uations S = {Si, ..., Sk}. Applying the j-th detector Dj 
to image / gives the boundary prediction Dj{I). We write 
the probability that image / corresponds to situation Sj as 
P{Sj\I), which we obtain using global image classification 
as explained in section 3.3. Now we get the final object 
boundary prediction D(/) by: 
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Figure 2. Overview of situational object boundary detection. For 
each situation there is a specialised boundary detector Dj which 
we apply by Dj{I). The specialised predictions vary greatly and 
are combined into a final prediction using Equation (2). 


Of course, we do not need to apply all object boundary de¬ 
tectors to the image since P{Sj\I) is likely to be small for 
most situations j. To reduce computational costs we take 
the top few n k situations for which P{Sj\ I) is highest. 
Formally, let 5 = {.§ 1 ,..., be an ordered set for a spe¬ 
cific image I such that P{Si\I) > P{Sj\I) for all i < j. 
Let V be the set of boundary detectors corresponding to S. 
Then the final object boundary prediction is obtained by: 

= ( 2 ) 

i=i 

where Z = 1^) ^ normalizing factor ensuring 

that the values of the predicted boundaries are comparable 
for different n and across images. 

We have two choices for n: either we fix n or we take 
n such that Z > m for a specific probability mass m. We 
determine the best solution experimentally in section 4.1. 

3.2. Situations 


For situational object boundary detection to work, the 
key is to define proper situations. We propose three ways to 
define our situations as visualised in Figure 3: class specific, 
subclass specific, and class agnostic. 

Class specific. As the term already says itself, object 
boundaries are caused by the presence of an object. A logi¬ 
cal way to define a situation is therefore to use class specific 
situations, leading to class specific boundary detection. We 
use class labels from the dataset to obtain these situations. 

Class specific situations constrain the appearance of ob¬ 
ject boundaries in two ways. Most importantly, instances 
of the same class tend to have similar appearance: in Fig¬ 
ure 3a the boundaries of a baboon are all a specific type of 
fur, while air balloons have a characteristic oval shape. Sec¬ 
ond, objects often occur in similar contexts: killer-whales 
are mostly in the water while balloons are often in the air. 
If both the context and object class is the same, there is lit¬ 
tle variation in the appearance of object boundaries and one 


can learn an object boundary detector which is sensitive to 
these specific object boundaries. 

Subclass specific. For some classes, its instances are de¬ 
picted in a variety of contexts, poses, and from a variety 
of viewpoints, which can significantly infiuence the ap¬ 
pearance of the object boundaries. Take for example the 
killer-whale in Figure 3b. Photographed in the wild the 
object boundaries are only caused by water-whale transi¬ 
tions, while in a whale-show object boundaries can also be 
caused by crowd-whale transitions. Furthermore, spurious 
edges caused by the crowd should not yield object bound¬ 
aries here. Additionally, a viewpoint from within the water 
or from above the water causes the object boundaries to be 
very different due to colour changes and absence/presence 
of foaming water or waves. Pose may also affect object 
boundary appearance: a sleeping, curled-up cat has much 
smoother boundaries than a playing cat. 

We create subclass specific situations by taking all im¬ 
ages of a certain class, model their global image appearance 
as described in Section 3.3, and apply k-means clustering. 

Class agnostic. Finally, the appearance of object bound¬ 
aries may be more infiuenced by context than by the object 
class itself. For example, as visualised in Figure 3c, pho¬ 
tographs taken through a fence yield spurious edges which 
are not object boundaries. Detecting such situation allows 
for using an object boundary detector which ignores edges 
from this fence. Furthermore, various object classes occur 
in similar contexts and share characteristics. Indeed, the 
second row shows furry animals in a forest environment, 
giving rise to a similar appearance of object boundaries. 

Therefore the last situation type we consider is class ag¬ 
nostic. We ignore all class labels and cluster all images of 
the training set using k-means on global image appearance. 
As shown in Figure 3c, this leads to clusters of objects in 
similar contexts, some with predominantly instances of a 
single class. 

3.3. Image Classification 

For each situation Sj G 5 we need to predict P{Sj\I). 
We do this using either Bag of Visual Words [8, 37] or Con¬ 
volutional Neural Net (CNN) features [21]. 

Bag of Visual Words. We extract SIFT descriptors [26] 
of 16 X 16 pixels on a dense regular grid [20] at every 4 
pixels using [39]. We use PCA to reduce SIFT to 84 dimen¬ 
sions. We train a GMM with diagonal covariance of 64 clus¬ 
ters. We then create Fisher Vectors following [30]: we use 
derivatives only with respect to the means and standard de¬ 
viations of the GMM. Vectors are normalized by taking the 
square root while keeping the sign, followed by L2 norm. 
We use a spatial partitioning [22] using the whole image 
and a division into three horizontal regions (e.g. [39]). The 
final Fisher representation has 43008 dimensions. 





























(a) Class specific (b) Subclass specific (c) Class agnostic 

Figure 3. Visualisation for the three types of situations used in this paper. Each row per subfigure depicts three example images of a single 
situation on ImageNet (Section 4.1). Figure 3a shows class specific situations, where each situation is a single object class. Figure 3b show 
subclass specific situations, beneficial for classes with significant context or pose variation such as the killer-whale. Finally, Figure 3c 
shows class agnostic situations, which results in contextually similar clusters, some containing predominantly images of a single class. 


CNN features. We use the publicly available software for 
deep Neural Networks of Jia et al. [19]. Instead of train¬ 
ing a specialized network for each dataset, we choose the 
more flexible option of using a pre-trained network, remov¬ 
ing the final classification layer, and using the last layer as 
global image appearance features. This was shown to yield 
excellent features by e.g. [1 1, 15, 33]. 

In particular, we use the pre-trained network modelled 
after Krizhevsky [21] that comes with [19], trained on the 
training set of the ILSVRC classification task [35]. This 
network takes as input RGB images rescaled to 227 x 227 
pixels. It consists of five convolutional layers, two fully 
connected layers, and a final classification layer which we 
discard. Hence we use the outputs of the 7-th layer as CNN 
features, yielding features of 4096 dimensions. 

Classification. For both the Fisher Vectors and CNN fea¬ 
tures, we train linear SVMs with Stochastic Gradient De¬ 
scent using [40]. We use cross-validation to optimize the 
slack-parameter and, following [1], to optimize the relative 
sampling frequency of positive examples. 

3.4. Boundary Detector 

As boundary detector we use the Structured Edge Forests 
of Dollar and Zitnick [10], as these are extremely fast and 
yield state-of-the-art performance. Using their standard set¬ 
tings, their detector predicts 16 x 16 pixel boundary masks 
from 32 x 32 pixel local image patches. From each local 
image patch a variety of colour and gradient features is ex¬ 
tracted. They train a random forest directly on the structured 
output space of segmentation masks: at each node they sam¬ 
ple 256 random pixel pairs and perform binary tests check¬ 
ing if both pixels come from the same segment. The result¬ 
ing 256 dimensional vector is reduced to a single dimension 
using PCA, where its sign is used as a binary label. This al¬ 
lows for the calculation of information gain as usual. 

Unless mentioned otherwise, we use their framework 
with standard settings except for the number of training 
patches. We lower these from 1 million to 300,000 resulting 
in similar performance as shown in Section 4.1. 


4. Results 

In Section 4.1 to 4.3, we evaluate our method on 
object boundary detection on ImageNet [35], Microsoft 
COCO [24], and Pascal VOC 2012 segmentation [13]. We 
use the evaluation software of [29], average results over all 
images and report precision/recall curves, precision at 20% 
and 50% recall, and average precision (AP). 

In Section 4.4, we evaluate our method on semantic con¬ 
tour detection on the SBD database [17] using their evalua¬ 
tion software and report average precision (AP). 

4.1. ImageNet 

Dataset. While ImageNet has no manually annotated ob¬ 
ject boundaries, Guillaumin et al. [16] obtained good seg¬ 
mentations using a semi-supervised segmentation transfer 
strategy, applied to increasingly difficult image subsets. As 
our training set, we use their most reliable segmentations 
created from bounding box annotations. As test set, we use 
the ground-truth segmentations collected by [16]. 

To keep evaluation time reasonable we randomly sam¬ 
ple 100 classes from the set of [16]. This results in 23,457 
training and 1,000 test images. Since each image is anno¬ 
tated with one object class, this experiment evaluates only 
boundaries of that class. 

Number of situations. For subclass specific situations, 
we choose to cluster classes into 10 subclasses, yielding 
1000 situations. For good comparison, we choose to also 
have the same number of 1000 class agnostic situations. 

Number of detectors at test time. We now establish the 
number of object boundary detectors to apply to get opti¬ 
mal performance using Equation (2). Table 1 shows results 
when varying n for subclass specific object boundary de¬ 
tection (other situations yield similar results). As can be 
seen, starting from n = 5 results saturate for both meth¬ 
ods. Looking at the probability mass Z, at n = 5 it is 61% 
for Fisher vectors and 71% for CNN features. However, 
Z greatly differs per image. Hence for stable and efficient 
computation time with optimal performance, we fix n = 5 






























n = 1 

n = 3 

n = 5 

n = 25 

Z - CNN - subclass specific 

47% 

65% 

71% 

85% 

Z - Fisher - subclass specific 

29% 

51% 

61% 

79% 

AP - CNN - subclass specific 

0.274 

0.289 

0.296 

0.295 

AP - Fisher - subclass specific 

0.267 

0.283 

0.290 

0.291 

AP - Monolithic 

0.258 

0.259 

0.260 

0.260 


Table 1. Influence of number of situational object boundaries de¬ 
tectors applied at test time. Results saturate in average precision 
(AP) after applying 5 object boundary detectors. 



precision at 
20% recall 

precision at 
50% recall 

average 

precision 

monolithic 

0.382 

0.282 

0.260 

CNN - class specific 

0.435 

0.311 

0.289 

CNN - subclass specific 

0.451 

0.317 

0.296 

CNN - class agnostic 

0.446 

0.315 

0.295 

Fisher - class specific 

0.426 

0.305 

0.283 

Fisher - subclass specific 

0.442 

0.312 

0.290 

Fisher - class agnostic 

0.429 

0.307 

0.284 

GT - class specific 

0.433 

0.311 

0.290 

monolithic - CNN enhanced 

0.385 

0.278 

0.259 


Table 2. Results on ImageNet show that situational object bound¬ 
ary detection significantly outperforms a monolithic strategy. 


random forest detectors (of 8 trees) for all subsequent ex¬ 
periments. 

Baseline. Our baseline (monolithic) is a single mono¬ 
lithic detector. However, for a fair comparison our baseline 
should be trained on the same number of training patches 
and use the same number of decision trees. This is equiv¬ 
alent to training multiple monolithic detectors [10]. As 
shown in Table 1, results are affected little by training more 
monolithic detectors, and stabilize at n = 5 at 0.260 AP. 

We also trained a random forest with the recommended 
IM training examples [10] instead of 300k. This yields 
0.262 AP. Since this is not signihcantly different, for con¬ 
sistency of ah experiments we choose as baseline n = 5 
random forests trained on 300k examples per tree. 

Situational Object Boundary Detection. Figure 4 and 
Table 2 show that situational object boundary detection 
signihcantly outperforms the monolithic approach. Using 
CNN features, at 20% recall, the precision for monolithic is 
0.38, while it is respectively 0.44, 0.45, and 0.45 for class 
specihc, subclass specihc, and class agnostic situations. 

Figure 4 shows that subclass specihc situations slightly 
outperform class specihc situations. This is because sub¬ 
division into subclasses by clustering yields more special¬ 
ized object boundary detectors, which are especially helpful 
when the object class can occur in different contexts. In¬ 
deed, looking at performance increase of individual classes, 
the use of subclasses yields an increase in AP of 0.04, 0.08, 
and 0.14 for respectively killer-whale, airship, and basket¬ 
ball. The variety of contexts of the killer-whale can be seen 
in Figure 3b, airships occur on the ground and against the 
sky, while basketball images range from basketball close- 
ups, to indoor competition (see Figure 5), to outdoor play. 

Note that the monolithic boundary detector is trained ex¬ 
clusively on the objects of interest. Hence if a local image 
patch causes a false boundary prediction, it is necessarily 



Figure 4. Performance of object boundary detection on ImageNet. 
Situational object boundary detection significantly outperforms 
monolithic. The black line is occluded by the blue. 


similar in appearance to a local image patch of a true object 
boundary. Now notice in Figure 5 that monolithic bound¬ 
ary detection hres on many non-object boundary edges: the 
crowd of the basketball player, the shade behind the dog, the 
dog’s internal boundaries, and the water of the killer-whale. 
Therefore such background edges are necessarily similar in 
appearance to true object boundaries. This means a mono¬ 
lithic approach can never work well in ah situations. 

In contrast, situational object boundary detection per¬ 
forms much better, especially when using subclass specihc 
situations. On the basketball image, our method ignores not 
only the crowd but also the player, which is good since the 
player is not the object of interest. For the dog our method 
focuses primarily on the dog boundaries ignoring shadow 
and its interior boundaries. For the killer-whale spurious 
edges caused by the water are ignored. 

We conclude that by using object boundary detectors 
specialized for the identihed situation, we effectively con¬ 
strain the expected local appearance of object boundaries, 
which helps resolving ambiguities. This yields signihcant 
improvements: whereas a monolithic approach results in 
0.260 AP, our subclass specihc situation yield 0.296 AP, a 
relative improvement of 14%. 

CNNs vs Fisher Vectors. Table 2 shows that CNN fea¬ 
tures work generally better than Fisher vectors for situa¬ 
tional object boundary detection. This conhrms other ob¬ 
servations on the strength of CNN features (e.g. [7, 11, 33]). 
For class-agnostic situations improvements are especially 
good since it improves both the creation of situations and 
the classihcation. We use CNN features for the remainder 
of this paper. 

Using ground-truth image labels. Table 2 includes an 
experiment where we use the ground-truth label to deter¬ 
mine which class-specihc boundary detector should be ap¬ 
plied (GT - class specific). This helps assessing the quality 
of the global image appearance classiher within our frame- 
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Figure 5. Qualitative comparison for monolithic versus situational object boundary detection. The upper row shows object boundary 
predictions. The lower row shows the ground truth boundaries and evaluation at 50% recall, with true positives in green, false positives 
in red, and undetected boundaries in grey. Monolithic boundary detection fires on many false object boundaries caused by the background 
and internal boundaries, while situational object boundary detection focuses much better on the boundaries of the object of interest. 


work. As the table shows, there is almost no difference be¬ 
tween GT - class specific and CNN - class specific. Hence 
within our framework global image classification achieves 
what can be maximally expected from it. 

CNN features inside the Random Forest. Theoretically, 
the Random Forests can learn from any features of different 
modalities. So it would arguably be simpler to directly pro¬ 
vide global image features to the Structured Edge Forests 
and bypass the intermediate step of classifying images into 
situations. We tried this with CNN features, which are 
stronger and have a lower dimensionality than Fisher vec¬ 
tors. We name this setting monolithic - CNN enhanced. Ta¬ 
ble 2 shows that this does not work better than the baseline 
monolithic detector. 

4.2. Microsoft COCO 

Dataset. Microsoft COCO [24] provides accurate seg¬ 
mentations for its 80 object classes such as person, banana, 
bus, cat, and others. We use vO.9 consisting of 82,783 train¬ 
ing and 40,504 validation images. Images contain on aver¬ 
age 7.7 different object classes. Since evaluation of bound¬ 
ary predictions is relatively slow by necessity [29], we limit 
evaluation to the first 5,000 images of the validation set 
(which comes already randomized). 

Number of situations. For our subclass specific situa¬ 
tions, we choose 10 subclasses per class, leading to a total 
of 800 situations. We also use 800 class agnostic situations. 


Results. In contrast to the previous experiment, here most 
images contain multiple object classes. Now the first ques¬ 
tion is: should we train (sub)class specific object boundary 
detectors on only the object boundaries of the target class 
or on the boundaries of all object classes present in the im¬ 
age? Results are shown in Table 3. Interestingly, results are 
slightly better for true single class object boundary detectors 
in the theoretical setting where we use the Ground Truth to 
determine the class label (GT - class specific). In contrast, 
when using CNN features results are slightly better when 
the detectors are trained on all object boundaries in the im¬ 
ages. This suggests that mistakes made by object classifi¬ 
cation can be partially amended by having object boundary 
predictors specialized to a certain context rather than to a 
certain object class. For the rest of this paper, we therefore 
train situational object boundary detectors always on all ob¬ 
ject boundaries present in the images of a situation. 

Figure 6 compares situational object boundary detection 
with the monolithic baseline. Whereas a monolithic ap¬ 
proach yields an AP of 0.368, our situational approaches 
yield a substantial higher AP at 0.408, 0.424, and 0.434 for 
respectively class specific, subclass specific, and class ag¬ 
nostic situations. The best AP improvement is almost 0.07 
for class agnostic situations. 

As before, subclass specific situations outperform class 
specific situations. But unlike ImageNet, on COCO the 
class agnostic situations slightly outperform the subclass 
specific. This is likely because in our ImageNet subset only 




































Figure 6. Performance of object boundary detection on the first 
5000 images of the COCO validation set. 


Detectors trained on class object boundaries only 


precision at 

precision at 

average 


20% recall 

50% recall 

precision 

GT - class specific 

0.566 

0.460 

0.422 

CNN - class specific 

0.543 

0.443 

0.407 

CNN - subclass specific 

0.560 

0.459 

0.418 

Detectors trained on all object boundaries within images 


precision at 

precision at 

average 


20% recall 

50% recall 

precision 

monolithic 

0.494 

0.406 

0.368 

GT - class specific 

0.556 

0.454 

0.416 

CNN - class specific 

0.544 

0.446 

0.408 

CNN - subclass specific 

0.567 

0.465 

0.424 

CNN - class agnostic 

0.578 

0.474 

0.434 


Table 3. Results on Microsoft COCO. Situational object boundary 
detection significantly outperforms a monolithic strategy. 


a single class is annotated, whereas COCO images often 
contain multiple classes The fact that class agnostic situa¬ 
tions are superior suggests that the whole context of the im¬ 
age is more important for determining which object bound¬ 
aries to expect than the specific object classes depicted. 

Figure 7 shows qualitative results. In contrast to a mono¬ 
lithic approach, our situational object boundary detector 
correctly ignores grass/gravel transitions in baseball, con¬ 
tours of buildings (which are not objects of interest) in 
streets, and interior boundaries of the train. 

We conclude that by identifying a situation, we can avoid 
many false positive object boundary predictions made by 
a monolithic detector. This leads to significant improve¬ 
ments: whereas a monolithic approach yields 0.368 AP, 
class agnostic situations yield 0.434 AP, a relative improve¬ 
ment of 18%. 

4.3. Pascal VOC 2012 segmentation 

Dataset. We use the 1,464 training and 1,449 validation 
images of Pascal VOC 2012 segmentation, annotated with 
contours for 20 object classes for all instances in all images. 

Number of situations. Since the dataset is a lot smaller 
than Microsoft COCO, we choose to have 5 subclasses per 
class to still have sufficient training data per situation, lead¬ 



Figure 8. Performance of object boundary detection on the Pascal 
VOC 2012 segmentation database. 



precision at 
20% recall 

precision at 
50% recall 

average 

precision 

monolithic 

0.514 

0.433 

0.396 

GT - class specific 

0.576 

0.470 

0.430 

CNN - class specific 

0.573 

0.469 

0.426 

CNN - subclass specific 

0.582 

0.475 

0.426 

CNN - class agnostic 

0.578 

0.472 

0.422 


Table 4. Results on validation of Pascal VOC 2012 segmentation. 


ing to 100 subclass specific situations. For fair comparison, 
we also cluster 100 class agnostic situations. 

Results. Results are presented in Figure 8 and Table 4. 
Again, with 0.426 AP the situational object boundary detec¬ 
tion significantly outperforms the monolithic performance 
of 0.396 AP. This is a relative 8% improvement. 

On this dataset, class specific situations have about the 
same performance as subclass specific and class agnostic. 
This is different than on ImageNet and COCO, most likely 
because the training set is smaller. Hence fine-grained sit¬ 
uations yield fewer benefits since both training appearance 
based classifiers and training object boundary detectors is 
more difficult with less data. 

4.4. Semantic Boundaries Dataset (SBD) 

In some applications one may want do ‘semantic contour 
detection’ [17], i.e. generating class-specific object bound¬ 
ary maps. Our class-specific boundary detectors can pro¬ 
duce such maps Dc(/), specific to class c, using (1) but 
with the summation running only over class j = c: 

Dc(/)=P(5c|/)-^c(/) (3) 

where P{Sc\I) is the probability that class c occurs in im¬ 
age / according to CNN-based classification. Dc{I) is the 
output of the class-specific boundary predictor for class c. 

We use the Semantic Boundaries Dataset of [ 1 7], which 
consists of 11,318 images from the Pascal VOC 2011 
trainval dataset, divided in 8498 training and 2820 test 
images. All instances of its 20 object classes were anno¬ 
tated with accurate figure/ground masks by crowdsourcing. 
We use the official evaluation software provided by [17]. 
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Figure 7. Examples from COCO. Odd rows: input image and boundary predictions. Even rows: ground truth boundaries and precision at a 
recall of 50%. True positives are green, false positives red, and undetected boundaries grey. While the monolithic detector often incorrectly 
fires on the background and internal boundaries, our situational object boundary detectors focus better on true object boundaries. 



Figure 9. Semantic Contour Detection on BSD. [17] versus our 
CNN-based class specific situational object boundary detector. 


As figure 9 shows our method considerably outper¬ 
forms [17] on most classes. While [17] report a mean AP 
of 0.207, we obtain 0.316 mAP. 

4.5. Computational Requirements 

Runtime on a test image is essentially constant for 
any reasonable number of situations: the most expensive 
component is the boundary detector [10] which takes 73 
ms/image on an Intel Core 15-3470. At test time we al¬ 
ways apply n = 5 detectors (Equation (2)). Extracting 


CNN features takes about 2 ms/image on a modern GPU. 
Linear classification on 4096 dimensions takes less than 2 
ms/image for 1000 situations. Hence our situational object 
boundary prediction takes around 0.37 s/image, which is 
still very fast for an object boundary detector (see e.g. [10]). 

5. Conclusion 

The appearance of true object boundaries varies from 
situation to situation. Hence a monolithic object bound¬ 
ary prediction approach which predicts object boundaries 
regardless of the image content is necessarily suboptimal. 
Therefore this paper introduces situational object bound¬ 
ary detection. Eirst the situation is determined based on 
global image appearance. Afterwards only those boundary 
detectors are applied which are specialized for this situa¬ 
tion. Since we build on [10], our situational object boundary 
prediction is fast and takes only 0.37 ms/image. More im¬ 
portantly, results on object boundary detection show consis¬ 
tent improvements on three large datasets: on Pascal VOC 
2012 segmentation [13], the automatically segmented Ima- 
geNet [16, 35], and Microsoft COCO [24], we obtained rel¬ 
ative improvements of respectively 8%, 14% and 18% AP. 
Eurthermore, on semantic contour detection our approach 
substantially outperforms [17] on their SBD dataset. 
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